Skip to content

Conversation

@yeazelm
Copy link
Contributor

@yeazelm yeazelm commented Dec 31, 2025

Issue number:

Related to: bottlerocket-os/bottlerocket#4673

Description of changes:
This builds the mps-control-daemon binary from the device plugin that allows MPS support. We have to patch the hardcoded paths for Bottlerocket usage since the device plugin assumes it can write to / which doesn't work with Bottlerocket.

This change also adds a new service to start this binary when settings request it. Otherwise it daemonizes sleep infinity to let systemd try-restart upon changing the settings for MPS.

The change should be safe to take without the bottlerocket-os/bottlerocket-kernel-kit#347 change or the upcoming settings change but the daemon will not work without the kmod update and the settings being properly set.

Testing done:
Build images with the kernel change, settings changes, and validated that a node will come up with MPS working if set in user data, and the services are restarted and MPS can be enabled at runtime as well.

Setting in userdata for a g6.2xlarge which only has one GPU

Details

eksctl config snippet for setting it at the beginning:

    bottlerocket:
      settings:
        kubelet-device-plugins:
          nvidia:
            device-sharing-strategy: "mps"
            mps:
              replicas: 2

Results in a node reporting nvidia.com/gpu.shared:

Capacity:
  cpu:                    8
  ephemeral-storage:      81854Mi
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  memory:                 31619656Ki
  nvidia.com/gpu.shared:  2
  pods:                   58
Allocatable:
  cpu:                    7910m
  ephemeral-storage:      76173383962
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  memory:                 30602824Ki
  nvidia.com/gpu.shared:  2
  pods:                   58

Setting the MPS after boot

Details

Start with a node with no configuration for MPS:

# apiclient get settings.kubelet-device-plugins.nvidia
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "cdi-cri",
        "device-partitioning-strategy": "none",
        "device-sharing-strategy": "none",
        "pass-device-specs": true
      }
    }
  }
}

# systemctl status
● ip-192-168-12-91.us-west-2.compute.internal
    State: running
    Units: 458 loaded (incl. loaded aliases)
     Jobs: 0 queued
   Failed: 0 units
    Since: Wed 2025-12-31 22:32:18 UTC; 5min ago
  systemd: 257.9
  Tainted: unmerged-bin
   CGroup: /
....

# systemctl status nvidia-mps-control-daemon
● nvidia-mps-control-daemon.service - NVIDIA MPS Control Daemon
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-mps-control-daemon.service; enabled; preset: enabled)
    Drop-In: /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d
             └─00-aws-config.conf
             /etc/systemd/system/nvidia-mps-control-daemon.service.d
             └─exec-start.conf
     Active: active (running) since Wed 2025-12-31 22:32:32 UTC; 5min ago
 Invocation: d1565c1130dc4d9e87108f540f1178da
   Main PID: 3111 (/usr/bin/sleep)
      Tasks: 1 (limit: 36988)
     Memory: 308K (peak: 1.2M)
        CPU: 5ms
     CGroup: /system.slice/nvidia-mps-control-daemon.service
             └─3111 /usr/bin/sleep infinity

Dec 31 22:32:32 ip-... systemd[1]: Started NVIDIA MPS Control Daemon.

# systemctl cat nvidia-mps-control-daemon
# /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-mps-control-daemon.service
[Unit]
Description=NVIDIA MPS Control Daemon
After=nvidia-k8s-device-plugin.service
Requires=nvidia-k8s-device-plugin.service

[Service]
Type=simple
ExecStart=/bin/true
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

# /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d/00-aws-config.conf
[Service]
# Set the AWS_SDK_LOAD_CONFIG system-wide instead of at the individual service
# level, to make sure new system services that use the AWS SDK for Go read the
# shared AWS config
Environment=AWS_SDK_LOAD_CONFIG=true

# /etc/systemd/system/nvidia-mps-control-daemon.service.d/exec-start.conf
[Service]
ExecStart=
ExecStart=/usr/bin/sleep infinity

The node shows one GPU:

Capacity:
  cpu:                8
  ephemeral-storage:  81854Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             31619660Ki
  nvidia.com/gpu:     1
  pods:               58
Allocatable:
  cpu:                7910m
  ephemeral-storage:  76173383962
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             30602828Ki
  nvidia.com/gpu:     1
  pods:               58

Then set MPS:

apiclient set settings.kubelet-device-plugins.nvidia.device-sharing-strategy=mps settings.kubelet-device-plugins.nvidia.mps.replicas=8


bash-5.1# apiclient get settings.kubelet-device-plugins.nvidia
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "cdi-cri",
        "device-partitioning-strategy": "none",
        "device-sharing-strategy": "mps",
        "mps": {
          "replicas": 8
        },
        "pass-device-specs": true
      }
    }
  }
}

Now check the rest of the system:

# systemctl status
● ip-192-168-12-91.us-west-2.compute.internal
    State: running
    Units: 458 loaded (incl. loaded aliases)
     Jobs: 0 queued
   Failed: 0 units
    Since: Wed 2025-12-31 22:32:18 UTC; 7min ago
  systemd: 257.9
  Tainted: unmerged-bin
   CGroup: /
           ├─default
...
# systemctl status nvidia-mps-control-daemon
● nvidia-mps-control-daemon.service - NVIDIA MPS Control Daemon
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-mps-control-daemon.service; enabled; preset: enabled)
    Drop-In: /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d
             └─00-aws-config.conf
             /etc/systemd/system/nvidia-mps-control-daemon.service.d
             └─exec-start.conf
     Active: active (running) since Wed 2025-12-31 22:39:41 UTC; 36s ago
 Invocation: 7191c5bc120246709e113d50d3ce3c54
   Main PID: 6994 (mps-control-dae)
      Tasks: 12 (limit: 36988)
     Memory: 49.1M (peak: 62M)
        CPU: 227ms
     CGroup: /system.slice/nvidia-mps-control-daemon.service
             ├─6994 /usr/bin/mps-control-daemon --config-file /etc/nvidia-k8s-device-plugin/settings.yaml
             ├─7015 nvidia-cuda-mps-control -d
             └─7021 tail -n +1 -f /run/mps/nvidia.com/gpu.shared/log/control.log

Dec 31 22:39:41 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:39:41.892 Control  7015] Accepting connection...
Dec 31 22:39:41 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:39:41.892 Control  7015] NEW UI
Dec 31 22:39:41 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:39:41.892 Control  7015] Cmd:set_default_active_thread_percentage 12
Dec 31 22:39:41 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:39:41.892 Control  7015] 12.0
Dec 31 22:39:41 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:39:41.892 Control  7015] UI closed
Dec 31 22:40:11 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:40:11.832 Control  7015] Accepting connection...
Dec 31 22:40:11 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:40:11.832 Control  7015] NEW UI
Dec 31 22:40:11 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:40:11.832 Control  7015] Cmd:get_default_active_thread_percentage
Dec 31 22:40:11 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:40:11.832 Control  7015] 12.0
Dec 31 22:40:11 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:40:11.832 Control  7015] UI closed

# systemctl cat nvidia-mps-control-daemon
# /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-mps-control-daemon.service
[Unit]
Description=NVIDIA MPS Control Daemon
After=nvidia-k8s-device-plugin.service
Requires=nvidia-k8s-device-plugin.service

[Service]
Type=simple
ExecStart=/bin/true
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

# /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d/00-aws-config.conf
[Service]
# Set the AWS_SDK_LOAD_CONFIG system-wide instead of at the individual service
# level, to make sure new system services that use the AWS SDK for Go read the
# shared AWS config
Environment=AWS_SDK_LOAD_CONFIG=true

# /etc/systemd/system/nvidia-mps-control-daemon.service.d/exec-start.conf
[Service]
ExecStart=
ExecStart=/usr/bin/mps-control-daemon --config-file /etc/nvidia-k8s-device-plugin/settings.yaml

# cat /etc/nvidia-k8s-device-plugin/settings.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  mpsRoot: "/run/nvidia/mps"
  plugin:
    passDeviceSpecs: true
    deviceListStrategy: cdi-cri
    deviceIDStrategy: index
    containerDriverRoot: "/"
sharing:
  mps:
    renameByDefault: true
    resources:
    - name: "nvidia.com/gpu"
      replicas: 8

And the node shows the empty nvidia.com/gpu offering but now a shared one:

Capacity:
  cpu:                    8
  ephemeral-storage:      81854Mi
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  memory:                 31619660Ki
  nvidia.com/gpu:         1
  nvidia.com/gpu.shared:  8
  pods:                   58
Allocatable:
  cpu:                    7910m
  ephemeral-storage:      76173383962
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  memory:                 30602828Ki
  nvidia.com/gpu:         0
  nvidia.com/gpu.shared:  8
  pods:                   58

This is a known edge case and is similar to how timeslicing works. In order to avoid old resources, you'd need to start with the user-data approach.

Shifting to rename-by-default=false(apiclient set settings.kubelet-device-plugins.nvidia.mps.rename-by-default=false) will have the original nvidia.com/gpu resource instead:

Capacity:
  cpu:                    8
  ephemeral-storage:      81854Mi
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  memory:                 31619660Ki
  nvidia.com/gpu:         8
  nvidia.com/gpu.shared:  8
  pods:                   58
Allocatable:
  cpu:                    7910m
  ephemeral-storage:      76173383962
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  memory:                 30602828Ki
  nvidia.com/gpu:         8
  nvidia.com/gpu.shared:  0
  pods:                   58

And finally, setting sharing to none disables MPS:

# apiclient set settings.kubelet-device-plugins.nvidia.device-sharing-strategy=none
# systemctl status nvidia-mps-control-daemon
● nvidia-mps-control-daemon.service - NVIDIA MPS Control Daemon
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-mps-control-daemon.service; enabled; preset: enabled)
    Drop-In: /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d
             └─00-aws-config.conf
             /etc/systemd/system/nvidia-mps-control-daemon.service.d
             └─exec-start.conf
     Active: active (running) since Wed 2025-12-31 22:44:41 UTC; 2s ago
 Invocation: 82664a64fd044762a81ecef6d1cc0462
   Main PID: 9436 (/usr/bin/sleep)
      Tasks: 1 (limit: 36988)
     Memory: 308K (peak: 1.2M)
        CPU: 4ms
     CGroup: /system.slice/nvidia-mps-control-daemon.service
             └─9436 /usr/bin/sleep infinity

Dec 31 22:44:41 ip-192-168-12-91.us-west-2.compute.internal systemd[1]: Started NVIDIA MPS Control Daemon.

And the resource goes back down to 1.

With the incompatibility checks in the template. You can see the messages preventing both MIG and MPS from running at the same time:

Details
Jan 15 16:01:19 ip-192-168-23-52.us-west-2.compute.internal systemd[1]: Starting NVIDIA MPS Control Daemon...
Jan 15 16:01:19 ip-192-168-23-52.us-west-2.compute.internal echo[11584]: MPS and MIG are not supported at the same time
Jan 15 16:01:19 ip-192-168-23-52.us-west-2.compute.internal systemd[1]: Finished NVIDIA MPS Control Daemon.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@yeazelm
Copy link
Contributor Author

yeazelm commented Jan 15, 2026

^ Updated the code to use the Type changes (thanks @KCSesh!) and responded to a few other comments.

There is also a new change that does the MIG and MPS incompatibility check in the template rendering. It echo's a warning these don't work together. This can easily be removed if NVIDIA removes this incompatibility in a future release of their device plugin.

Jan 15 16:01:19 ip-192-168-23-52.us-west-2.compute.internal systemd[1]: Starting NVIDIA MPS Control Daemon...
Jan 15 16:01:19 ip-192-168-23-52.us-west-2.compute.internal echo[11584]: MPS and MIG are not supported at the same time
Jan 15 16:01:19 ip-192-168-23-52.us-west-2.compute.internal systemd[1]: Finished NVIDIA MPS Control Daemon.

Add support for NVIDIA Multi-Process Service (MPS) control daemon,
including service configuration and device plugin updates.

Signed-off-by: Matthew Yeazel <[email protected]>
@yeazelm
Copy link
Contributor Author

yeazelm commented Jan 15, 2026

^ Updated to address comments around RemainAfterExit and default noop settings.

@yeazelm yeazelm merged commit 4731f9f into bottlerocket-os:develop Jan 16, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants